1 Stoat in a nutshell

STOAT is a versatile GWAS (Genome-Wide Association Study) tool designed to work with pangenome graphs.

It supports binary phenotypes both with and without covariates: using Fisher’s exact test or Chi-squared test when no covariates are present, and logistic regression when covariates are included.

For quantitative traits, STOAT performs linear regression, again with or without covariates. Additionally, it supports eQTL analysis, enabling association testing between genetic variants and gene expression levels.

It containt 2 modes :

  • VCF : that will take VCF file in input and test snarl variant in it
  • Graph : that will test snarl directly on the pangenome graph

1.1 Stoat version 0.2.0 (08/2025)


  • Stoat:

    • Cartoon Update: Added stoat graph (working progress).
    • Unified output format (position, snarl_id, type).
  • Stoat VCF:

    • Linear Regression: Corrected near-perfect collinearity case by merging identical columns.

    • Logistic Regression: Corrected output format.

    • Other Fixes:

      • Fixed snarl paths parsing.
      • Fixed eQTL file format parsing (handles cases with large handlegraphs).
  • Stoat Graph:

    • …


1.2 Simulation Description

We constructed three simulated datasets: binary, quantitative, and eQTL.

The binary and quantitative simulations each include their own pangenome graph representing a single chromosome, incorporating various types of variation, such as SNPs, INDELs, and complex variants. Despite the different variations, the graph structure—based on a fork pattern—remains consistent across all simulations.

A fork structure is defined as a boundary snarl with exactly two paths, which can themselves contain nested forks (see fork representation below).

Once the pangenome graph was constructed, we simulated individual haplotypes by generating sample genomes that traverse different paths within the graph. To introduce phenotype-genotype relationships, we assigned each haplotype a binary or quantitative phenotype according to specific simulation rules.

The eQTL simulation differs in that it includes 10 chromosomes, but no pangenome graph is constructed. Instead, we directly simulate the path-snarl file required by STOAT, and the variations consist only of simple SNPs.


Type Number of samples Number of variant type (SNP/INDEL/COMPLEX) Number of snarl/paths
Binary 200 2444/446/158 1524/3048
Quantitative 200 2478/382/138 1499/2998
eqtl 200 200000/0/0 100000/200000

1.2.1 Binary Simulation

200 samples were divided into two cohorts (100 cases and 100 controls), each corresponding to a different phenotypic state containing 1,000 variations. Each group had a correlated probability of traversing a specific path within each snarl. This probability (e.g., 50/50) could be equal between the two cohorts or skewed in favor of one group (e.g., 20/80), simulating an association between variation and phenotype.


1.2.2 Quantitative Simulation

Similar to the binary simulation, 200 samples were divided into two cohorts, with each group having a correlated probability of traversing specific paths within snarls. In this case, an additional probability factor was introduced so that the likelihood of passing through a given path depends on the individual’s phenotype value.


1.2.3 eQTL Simulation

In this simulation, we generated 200 samples with 200,000 SNPs and 100 genes. All variants were generated randomly, meaning no significant associations were introduced. The goal was to create a simple simulation to test STOAT’s VCF-based eQTL pipeline.


1.2.4 Covariate Simulation

Covariates were simulated based on the phenotype, except for sex. They include SEX, PC1, PC2, and PC3. The SEX covariate is randomly assigned as male or female, while the other PCs are simulated from a normal distribution.


1.2.5 Verifying the Truth Simulation Description

1.2.5.1 Frequency File Description

The binary and quantitative simulations produce a file containing the following elements:

  • The frequency probability (freq) of a haplotype, depending on its group, to follow a specific path or edge in a fork.
  • The group associated with each frequency.
  • The start and end nodes of the fork.

If two groups have identical probabilities on the same edge, it may appear significant by chance but will be treated as non-significant in downstream analysis. Conversely, a difference in frequency between the two groups, even if small, is treated as significant.

Example freq file:

start_node  next_node   group   freq
2           3           0       0.53
2           3           1       0.53
2           4           0       0.47
2           4           1       0.47

1.2.5.2 Computing STOAT Output with the Frequency File

To assess whether a snarl is correctly identified as significant, we merge the snarl’s path information with the frequency probabilities from the freq file.

A correct match is a snarl that contains both start/end node pairs in two separate paths (one for each group).


1.2.6 Graph Information

  • A snarl is considered true if it contains a fork where at least one edge shows a frequency difference between the two groups.
  • A snarl is considered positive if its P-value is below 0.01.
  • The p-value used for verification in the binary simulation is the CHI2 p-value.
  • The size of the dots in the p-value vs. simulated effect graphs is proportional to the number of haplotypes passing through the snarl.

1.2.7 Binary stoat VCF

## 
## Pourcentage of paths (present in freq file) tested :  95.99737532808399


1.2.8 Binary stoat covar VCF

## 
## Pourcentage of paths (present in freq file) tested :  89.56692913385827

The reason why the percentage of paths tested isn’t 100% can be explained by the fact that we filter snarls based on a threshold for the number of haplotypes and the number of samples present in each snarl.


1.2.9 Binary stoat GRAPH

## 
## Pourcentage of paths (present in freq file) tested :  98.9501312335958

1.2.10 Compare stoat graph vs stoat vcf binary output

##          Column Num_Different_Rows
## 1           CHR                  0
## 2     START_POS                  0
## 3       END_POS                  0
## 4  PATH_LENGTHS                  0
## 5      P_FISHER               1295
## 6        P_CHI2               1439
## 7    P_ADJUSTED               1449
## 8   GROUP_PATHS               1451
## 9         DEPTH                  0
## 10          POS                  0

This change can be explained because the stoat graph uses the ‘ref’ haplotype as the correct haplotype, whereas in the simulation it is not.


1.2.11 Quantitative stoat VCF

## 
## Pourcentage of paths (present in freq file) tested :  90.86057371581055


1.2.12 Quantitative covar stoat VCF

## 
## Pourcentage of paths (present in freq file) tested :  90.86057371581055


1.2.13 eQTL stoat VCF


1.2.14 eQTL covar stoat VCF